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Abstract. We introduce an algorithm for the alignment of protein¬ 
coding sequences accounting for frameshifts. The main specificity of this 
algorithm as compared to previously published protein-coding sequence 
alignment methods is the introduction of a penalty cost for frameshift ex¬ 
tensions. Previous algorithms have only used constant frameshift penal¬ 
ties. This is similar to the use of scoring schemes with affine gap penalties 
in classical sequence alignment algorithms. However, the overall penalty 
of a frameshift portion in an alignment cannot be formulated as an affine 
function, because it should also incorporate varying codon substitution 
scores. The second specificity of the algorithm is its search space being 
the set of all possible alignments between two coding sequences, under 
the classical definition of an alignment between two DNA sequences. 
Previous algorithms have introduced constraints on the length of the 
alignments, and additional symbols for the representation of frameshift 
openings in an alignment. The algorithm has the same asymptotic space 
and time complexity as the classical Needleman-Wunsch algorithm. 
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1 Introduction and motivation 

Comparative genomics is currently facing a huge challenge with the revelation 
of a growing number of genes having multiple alternative coding sequences in 
several species W- The various coding sequences arising from a same gene or 
homologous genes differ not only by mutations in the nucleotide sequences, but 
also by alternative start codons and alternative splicing of exons. All these mech¬ 
anisms often induce translation frameshifts that lead to different translations of 
a same portion of gene in distinct coding sequences [S]. This new enlightment on 
the complexity of gene architecture evolution calls for novel algorithms for the 
comparison of coding sequences capable to account for the presence of translation 
frameshifts between coding sequences. 

The problem of aligning two coding sequences is an optimization problem 
that consists in finding an optimal score alignment in a set of alignments be¬ 
tween the two sequences. A coding sequence is a DNA sequence composed of a 
succession of words of length 3 called codons. An alignment between two DNA 
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sequences A and i? is a pair of sequences A and B' of same length L on the 
alphabet of nucleotides augmented with the gap symbol such that A and B' 
do not contain a gap symbol at a same position, and A and B can be derived 
from A and B' by removing all the gap symbols. The length L of A and B' is 
called the length of the alignment. A translation franieshift in an alignment be¬ 
tween two coding sequences is caused by i) the deletion of one or two nucleotides 
of a codon (for example, a codon ACC aligned with A—), or ii) the insertion 
of nucleotides between two nucleotides of a codon (for example, a codon A—CC 
aligned with AGACC). The computation of an optimal alignment between two cod¬ 
ing sequences should account for both the translation of the coding sequences 
into protein sequences, and the presence of translation frameshifts between the 
two coding sequences. 

A classical approach for comparing two coding sequences consists in a three- 
step method, where coding sequences are first translated into protein sequences, 
next protein sequences are aligned, and finally the protein alignment is back- 
translated to a coding sequence alignment. This approach is used in most tools 
for multiple alignment of coding sequences mmm- However, it is not able to 
account for the presence of frameshifts between coding sequences. 

The problem of aligning two coding sequences of length n and m while 
accounting for both the corresponding protein sequences and the presence of 
frameshifts was first addressed by Hein et al. m- They proposed a DNA/pro- 
tein model such that the score of an alignment between two coding sequences is 
a combination of its score at the DNA level and its score at the protein level. 
Under this model, a O(n^.TO^) algorithm and then a 0{n.m) algorithm [3] 
were proposed to compute an optimal score alignment. The search space of the 
algorithms are the set of alignments that can be each uniquely decomposed into 
a succession of sub-alignments of eleven (11) types. The eleven types of sub¬ 
alignment are defined such that the length of each of them is a multiple of 3. 
Thus, the total length of any alignment in the search space is always a multiple 
of 3, and the score of an alignment is the sum of the scores of its sub-alignments. 

Arvestad [5] proposed another O(n.m) protein-coding alignment algorithm 
based on the concept of generalized substitutions introduced in m- In this 
algorithm, an alignment between two coding sequences A and H is a pair of 
sequences on the alphabet of nucleotides augmented with the gap symbol 
and the frameshift symbol ’!’. The search space of the algorithm is the set of 
alignments that are each composed of a succession of sub-alignments of length 3 
such that each sub-alignment is an alignment between two codon fragments of A 
and B. A codon fragment of a coding sequence S is defined as a word of length 
0 to 5 in S. If a codon fragment has a length of 4 (resp. 5), then one or two 
nucleotides in the codon fragment are dropped in order to fit in a sub-alignment 
of length 3. Such dropped nucleotides are simply ignored in the definition of the 
score of a length-3 sub-alignment. If a codon fragment has a length of I or 2, 
then two or one frameshift opening symbols M’ are added in the codon in order 
to fit in a sub-alignment of length 3. The score of an alignment is then defined 
as the sum of the scores of its length-3 sub-alignments. 
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More recently, Ranwez et al. m proposed a simplification of the model of 
Arvestad [2] where a codon fragment of a coding sequence S is defined as a 
word of length 0 to 3 in S. Thus, no supplemental combinatorics are required 
in order to consider all the possibilities of dropping one or two nucleotides from 
a codon fragment of length 4 or 5. The algorithm has a complexity in 0{n.m). 
This method was extended in the context of multiple protein-coding sequence 
alignment HU. 

The above three methods izisnu compare two coding sequences while ac¬ 
counting for the presence of translation frameshift openings between the two 
sequences. A frameshift in an alignment is penalized by adding a constant 
frameshift cost, which only penalizes the initiation of a frameshift, not account¬ 
ing for the extension of this frameshift in the alignment. 

For example, we consider the following three coding sequences : Seql, Seq2, 
and Seq3. Seql has a length of 45. Seq2 (resp. Seq3) has a length of 60 and is 
obtained from Seql by deleting the nucleotide ’C’ at position 30 (nucleotide ’G’ 
at position 15) and adding 16 nucleotides at the end. 

Seql: ATGACCGAATCCAAGCAGCCCTGGCATAAGTGGGGGAACGATTGA 
MTESKQPWHKWGND* 

Seq2: ATGACCGAATCCAAGCAGCCCTGGCATAATGGGGGAACGATTGAAGTAGGAACGATTTAA 
MTESKQPWHNGGTIEVGTI* 
Seq3: ATGACCGAATCCAACAGCCCTGGCATAAGTGGGGGAACGATTGAAGTAGGAACGATTTAA 
MTESNSPGISGGTIEVGTI* 

When looking at the translations of Seql and Seq2, it is easily observable that 
Seq2 is more similar to Seql, than Seq3 is similar to Seql. However, the pair¬ 
wise alignment algorithms accounting for frameshifts mm would return the 
same score for the two following optimal alignments of Seql and Seq2, and Seql 
and Seq3, penalizing only the initiation of a frameshift in both cases (positions 
colored in red in the alignments). 

Optimal alignment between Seql and Seq2: 
MTESKQPWHKWGND*------ 

ATGACCGAATCCAAGCAGCCCTGGCATAAGTGGGGGAACGATTGA- 

ATGACCGAATCCAAGCAGCCCTGGCATAA-TGGGGGAACGATTGAAGTAGGAACGATTTAA— 
MTESKQPWH ! WGND*SRNDL ! 
Optimal alignment between Seql and Seq3: 
MTESKQPWHKWGND*------ 

ATGACCGAATCCAAGCAGCCCTGGCATAAGTGGGGGAACGATTGA- 

ATGACCGAATCCAA-CAGCCCTGGCATAAGTGGGGGAACGATTGAAGTAGGAACGATTTAA— 
MTES ! QPWHKWGND*SRNDL ! 

We describe a pairwise alignment algorithm that uses a scoring scheme pe¬ 
nalizing both the initiation and the extensions of frameshifts (positions colored 
in blue in the alignments). In Section]^ some preliminary definitions of align¬ 
ments and the description of the problem are presented. In Section the new 
algorithm for computing an optimal score alignment is described. 










4 


Belanger-Ouangraoua 


2 Preliminaries : Alignment of protein-coding sequences 

In this section, we formally describe coding sequences and the pairwise alignment 
problem that is solved in Section 

Definition 1 (Coding sequence). A coding sequence is DNA sequence on the 
alphabet of nucleotides Sjq = {a,c,g,t\ whose length n is a multiple of S. A 
coding sequence is composed of a succession of ^ codons that are the words of 
length 3 in the sequence ending at positions Si, 1 < i < ^. The translation 
of the coding sequence is a protein sequence of length ^ on the alphabet of 
amino acids (aa) such that each codon of the coding sequence is translated into 
an amino acid in the protein sequence. 

In this work, the definition of an alignment between two coding sequences is 
exactly the same as the classical definition of an alignment between two DNA 
sequences used by the Needleman-Wunsch algorithm for the comparison of two 
sequences [7]. 

Definition 2 (alignment between DNA sequences). An alignment between 
two DNA sequences A and B is a pair {A', B') where A' and B' are two sequences 
of same length L derived by inserting gap symbols 'in A and B, such that 
Vi, 1 < i < L, A'[i] fA or B'[i] fA Each position i, 1 < i < L, in the 
alignment is ealled a column of the alignment. 

Given a sequence S of length L on the alphabet E = {a, c, g, t, —}, S[k .. 1] , 1< 
k < I < L, denotes the subsequence of S going from position k to position 1. 
|S'[fc .. Z]| denotes the number of letters in S[k .. 1] that are different from the 
gap symbolFor example, |AC—G| = 3. 

Given an alignment (A', B') between two coding sequences A and B, a codon 
of A or B is grouped in the alignment if its three nucleotides appear in three 
consecutive columns of the alignment. For example, a codon ACC that appears 
in the alignment as ACC is grouped, while it is not grouped if it appears as A-CC. 

In the following, we give our definition of the score of an alignment between 
two coding sequences A and B. It is based on a partition of the codons of A and 
B into four sets (types): 

The set of Matching codons (M) contains the codons that are grouped in 
the alignment, and aligned exactly with a codon of the other sequence. 

The set of Unmatching codons (U) contains the codons that are grouped 
in the alignment, and aligned with three consecutive nucleotides of the other 
sequence that do not form a codon. 

The set of Deleted/Inserted codons (InDel) contains the codons that are 
grouped in the alignment, and aligned with a succession of 3 gaps. 

All other codons are frameshift codons. Following the definitions and nota¬ 
tions for frameshifts used in m , the set of frameshift codons can be divided into 
two sets. The set of frameshift codons caused by deletions (FS“) con¬ 
tains the codons that are grouped in the alignment, and are aligned with only 
one or two nucleotides in the other sequence and some gap symbols. The set 
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of frameshift codons caused by insertions (FS^”) contains all the codons 
that are not grouped in the alignment. 

The set of Matching nucleotides in frameshift codons (MFS) contains 
all the nucleotides belonging to a frameshift codon, and aligned with a nucleotide 
of the other sequence. 

The substitutions of matching (M) and unmatching (U) codons are scored 
using an amino acid scoring function Saa^ and a fixed frameshift extension cost 
denoted by fs_extension_cost is added for each unmatching codon (U). The 
insertions/deletions of codons (Indel) are scored by adding a fixed gap cost 
denoted by gap_cost for each inserted/deleted codon (Indel). The alignment of 
frameshift codon nucleotides (MFS) are scored independently from each other, 
using a nucleotide scoring function San- The insertions or deletions of nucleotides 
from frameshift codons are responsible for the initiation of frameshifts. They are 
then scored by adding a fixed frameshift opening cost denoted by fs_open_cost 
for each frameshift codon. 

In the following definition of the score of an alignment, the matching (M), 
unmatching (U), and deleted/inserted (InDel) codons of A and B are simply 
identified by the position (column) of their last nucleotide in the alignment. The 
matching nucleotides in frameshift codons (MFS) are also identified by their 
positions in the alignment. 


Definition 3 (Score of an alignment). Let {A', B') he an alignment of length 
L between two coding sequences A and B. 


Ma^b = {k,k<L 

Ua^b = {k,k<L 

Indel a^b — {k^k < L 

MFSa^b ^ {k,k<L 
a (i,j) s.t 

Mb^a = {k,k < L 

Ub^a = {k,k < L 

Indels^A — {ky k < L 

MFSb^a ^ {k,k<L 
a (j,i) s.t. 


3 {i, j) s.t. A'[k — 2 .. fc] — A[3i — 2 .. 3*] and B'[k — 2 .. k] — B[3j — 2 .. 3j]} 

k ^ Ma^b o.nd 3 i s.t. A'[k — 2 .. k] — A[3i — 2 .. 3i] and \B'[k — 2 .. fc]| — 3} 

3 i s.t. A'[k — 2 .. k] — A[3i — 2 .. 3?] and \B'[k — 2 .. k]\ — 0} 

{fc, fc + 1, + 2} n {Ma^b U Ua-^b U InDelA^B) — 0 and 
— A[i\ and B'[k] — S[j]} 

3 (j, i) s.t. B'[k — 2 .. k] — B[3j — 2 .. 3j] and A'[k — 2 .. k] = A[3i — 2 .. 3i]} 

k ^ Mb^a and 3 j s.t. B'[k — 2 .. k] — A[3j — 2 .. 3j] and — 2 .. k]\ — 3} 

3 j s.t. B'[k — 2 .. k] — B[3j — 2 .. 3j] and — 2 .. k]\ — 0} 

{fc, fc + 1, + 2} n {Mb^a U Ub-^a U InDelB^A) — 0 and 

B'[k] — B[j] and A'[k] — Afi]} 


The score of the alignment (A', B') is defined by : 


= Ek^MA^B ■■ ■■ ^ 1 ) + 


E 


/ Saa(A'[k — 2 .. k],B'[k — 2 .. fel) . « , . i 

^ ^ ^ fs_extenszon_cost) + 


kellA^B '' 2 

\lndelA^B\ * gap^cost + 

(i|^ - \Ma^b\ - \Ua^b\ - \lnDelA^B\) * fs^open^cost + 

San ^A' \k\^B' [fc]) 

Z^keMFSA^B 2 
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score{B') 


^keMi 


Saa{B'[k-2 ■■ k],A'[k-2 .. fc]) 


(Saa{B'\k—2 .. k\,A'[k — 2 .. A;]) . « , . ,\. 

^^^ + fs_extenszon_cost)-\- 

\lndelB^A\ * gap_cost+ 

(ifi - \Mb^a\ - \Ub^a\ - \lnDelB^A\) * fs_open_cost+ 

UkaMFSe 


San {B' [fc] ,A' [fc]) 


For example, consider the two following sequences, A containing 13 codons 
and B containing 14 codons, and an alignment of length 48 between them. 

A: ATGACCGAATCCAAGCAGCCCTGGCCAGATCAACGTTGA 
MTESKQPWPDQR* 

B: ATGGAGTCGAAGATCAGCTGGCAGGCCATTGGCAATGACTGA 
MESKISWQAIGND* 

An alignment (A’,B’) of length 48 between A and B: 
pos 000000000111111111122222222223333333333444444444 
123456789012345678901234567890123456789012345678 
MTESK qPWP D QR* 

A ’ ATGACCGAATCCAAG— CAGCCCTGGCCAG— AT— CAACG-TTGA 
B ’ ATG— GAGTCGAAGATCAGC— TGG-CAGGCCATTGGCAATGACTGA 
M ESKIS W QAIGND* 


The composition of the different sets of codons and nucleotides used in the 
definition of the score of the alignment (A', B') are: = {3,9,12,15,26,48}; 

= {20,41}; IndelA^B = { 6 }; MFSa^b = (21, 28, 29,30,34,35,42,43,45}; 
Kb^a = {3, 9,12,15, 26,48}; Ub^a = {21,30,42}; Indels^A = {33}; and 
MFSb^a = {18,34,35,39,43,45}. 


3 Algorithm 

In this section, we describe a 0(ri.m) time and space complexity algorithm that 
solves the problem of finding a maximum score alignment between two coding 
sequences A and B of lengths n and m. Similarly to other sequence comparison 
methods mm, we use dynamic programming tables of size n + 1 x m + 1 that 
are indexed by the pairs of prefixes of the two coding sequences. The table D 
stores the maximum scores of the alignments between prefixes of A and B. The 
table Dp IS used to account for potential cases of frameshift extensions that are 
counted subsequently. 

Definition 4 (Dynamic programming tables). Given two coding sequences 
A and B as input, the algorithm uses two dynamic programming tables D and 
Dp of size n + 1 X m + 1. The cell D{i,j) contains the maximum score of an 
alignment between the prefixes A[1 .. i] and B[1 .. j]. The table Dp is filled only 
for values of i and j such that i{mod 3) = 0 or j {mod 3) = 0. If i{mod 3) 7 ^ 0 
(resp. j{mod 3) 7 ^ 0}, the cell Dp{i,j) contains the score of an alignment between 
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the prefixes A[1 .. i + a] and B[1 .. j + a] where a = (3 — i){mod 3) (resp. 

Of = (3 — j)(mod 3 )). The table Dp is filled as follows: 

— If i(mod 3) = 0 and j {mod 3) = 0, DF{i,j) = D{i,j). 

— If i{mod 3) = 0 and j{mod 3) = 2, or i{mod 3) = 2 and j{mod 3) = 0, 

Dp{i,j) contains the maximum score of an alignment between A[1 .. z + 1] 

and B[1 .. j + 1] such that A[i + 1] and B[j + 1] are aligned together, and 
half of the score for aligning A[z + 1 ] with i?[z + 1 ] is subtracted. 

— If i{mod 3) = 0 and j{mod 3) = 1, or i{mod 3) = 1 and j{mod 3) = 0, 

Dp{i,j) contains the maximum score of an alignment between A[1 .. z + 2] 

and B[1 .. j + 2] such that A[i + l],B[j + 1] and A[i + 2],B[j + 2] are aligned 
together, and half of the scores of aligning A[i + 2], _B[z + 2], and A[i + 1], 
B[i + 1] is subtracted. 

Lemma 1 (Filling up table D). 

1. If i{mod 3) = 0 and j{mod 3) = 0 

1. .. i],B[j-2 .. j]) + D{i-3,j-3) 

2. Sa 7 i(A[f], B[j]) + SaTi(^['i — l]i B[j — 1]) + D{i — 3, j* — 2) + 2 * fs_open_cost 

3. San(A\i], B\j]) + “ 2 , S j - 1 ) + D(i - 3, 7 - 2) + 2 * fs.open.cost 

4. B[j]) + D{i - 3,j - 1) + 2 * fs.open.cost 

5. SaniA[i]^ B[j]) + San{A[i — l], B[j — l]) + D{i — 2, J — 3) + 2 * fs_open_cost 

6 . San{A\i\. B\j\) + San(A\i — ll, B\j — 21) + Dit — 2, 7 — 3) + 2 * fs.open.cost 

7. SarviA\i], B[j]) + D{i - l,j - 3) 2 ^ fs.open.cost 

8. San{A[i\., B[j\) -\- D{i — 1, j — 1) -\- 2 fs.open.cost 

San.(A[i-l],B[j]) ^ s^rx(A[i-2], B[j-1]) ^ - 2) + fs.OpeU.COSt 

B(i, j) — max < SaniA[i — 1], B[j]) -\- D{i — 3, j — 1) 2 *■ fs_open_cost 

11 . _ 1 )-|- fs.open.cost 

12. gap.costD{i — 3, j) 

13. D{i — 1, j) + fs.open.cost 

14. ^an(A[z]^B[j — 1]) Sa,n(A[i — l],B[j — 2]) ]J ^ _ 2, J — 3) + fS.OpeU.COSt 

15. Sa,'n(A[ 7 ], B[j — 1]) + D(i — 1, J — 3) + 2 * fs.open.cost 

16. ^ — l,j — 3) + fs.open.cost 

17. gap.costD{i., j — 3) 

, 18. D{i,j — 1) + fs.open.cost 

2. If i{mod 3) = 0 and j{mod 3) 7 ^ 0 

1 . — 2 .. z].B[j — 2 .. j]) P)p(^i — 3^7 _ 3 - fs_extension-Cost 

2. San{A[i\, S[j]) + San{A[i — l], B[j — l]) + D(i — 3, J — 2) + fs.open.cost 
{-{-fs.open.cost if j — l{mod 3) = 0) 

3. San{A[i], B[j]) + San{A[i — 2], B[j — 1]) + Dp (i — 3, J — 2) + fs.open.cost 

San{A[t-2],B[j-l]) ^ 

4. Sa,'n(A[ 7 ],S[j])+ 4)(7 — 3,j — 1) + fs.open.cost 
D{i., j) — TAcLiK. ' 5 _ San{A[i\., B[j\)-\-D{i — 1, j — 1)-\-fs.open.cost 

6 . San{A[i — 1], B[j]) + San{A[i — 2], B[j — 1]) + Dp (i — 3,j—2) + fs.open.cost 

‘ar,{A[i-2],Blj.l]) 3 ) ^ g) 

7. SaniA[i — 1], B[j]) + D{i — 3, j — 1) + fs.open.cost 

8 . San{A[i — 2], B[j]) + D{i — 3, j — 1) + fs.open.cost 

9. gap.cost-\-D{i — 3, j) 

10. D{i — 1, j) + fs.open.cost 
I 11. D{iJ - 1) 

3. If i{mod 3) 7 ^ 0 and j{mod 3) = 0, the equation is symmetric to the previous 
case. 

4- If i{mod 3) 7 ^ 0 and j{mod 3) 7 ^ 0 

( 1 . Sa„{A[i], B[j]) + D(i - l,j - 1 ) 

D{i,j) — max < 2. D{i — l,j) 

I 3. DiiJ-l) 
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Case 1. i(mod 3) = 0 and j{mod 3) = 0 


1. 

. . . - XX X 

(a) 1. 


2. 

/ , A -XX X 

(a) 11. A. 

- X X 

3 . 4 . 

- XXX . ... ^ - XXX 

(a) 11. B. 

5. 

..... XX 

(a) 111. 

6. 7. 

- x-x - --X 

- XXX - XXX 

8 . 

(a) iv. 


9 . 

/I-X • A -XXX 

(b) 1. A. 

- XX- 

10. 11. 12. 

X, X • ^ -XXX - XXX ^ - XXX 

(b)i.B. (b)i.C. 

13. 

(b) ii. 


14. 

15. 16. 17. 


-XXX -XXX -XXX - XXX 

18 ^ 

-X 


Case 2. i(mod 3) = 0 and j{mod 3) 7 ^ 0 


1 . 

... A - XXX 

(a) 1. A. 

XXX 

2 . 3 . 4 . 

-XXX -XXX -XXX 

(a) 1. B. (a) 1. C. 

(a) ii. 

^ ' -X 


6 . 

/ux • A - XXX 

(b)i.A. 

XX- 

7 . 8 . 9 . 

„ -XXX -XXX _- XXX 

(b)i.B. (b)i.C. 

10. 

(b) ii. 


11 . 



Fig. 1. Illustration of the configurations of alignment considered in Lemma for com¬ 
puting D{i,j) in the cases 1 and 2. The right-most nucleotides of the sequences ^[1 .. i] 
and B\1 .. j] are represented using the character x. The nucleotides are colored accord¬ 
ing to the type of the codon to which they belong : matching codons (M) in blue color, 
unmatching codons (U) in red color, inserted/deleted codons (Indel) in green color, 
and frameshift codons (FS) in black color. The nucleotides that appear in gray color 
are those belonging to codons whose type has not yet been decided. In such case, the 
table Df is used in order to decide of the type of these codons later, and adjust the 
score accordingly. 
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Proof (Proof of Lemma^. The principle of the proof is similar to the one for 
the alignment of non-coding sequences [7j. For each case, the score D{i,j) is the 
maximum score of all possible alignment configurations that are considered for 
this case. Here, we only describe the alignment configurations considered in the 
case 1 where i{mod 3) = 0 and j{mod 3) = 0. A complete proof for all the cases 
of the Lemma is given in Appendix. An illustration of the different configurations 
of alignment considered for the cases 1 and 2 is shown in Figure [l] 

1. If i{mod 3) = 0 and j{mod 3) = 0, there are three cases depending on the 
alignment of A\i\ and B[f\. 

(a) If A\i\ and B\j] are aligned together, there are four cases depending 
on whether A[i — 2 .. i] and B[j — 2 .. j] are grouped in the alignment 
or not. 

i. If both A[i — 2 .. i] and B[j — 2 .. j] are grouped, then —2 .. i] 
and B[j — 2 .. j] have to be aligned together, and the score of the 
alignment is: 

1. Saa{A[i - 2 .. i],B[j - 2 .. j]) + D{i - 3, J - 3) 

ii. If A\i — 2 .. i] is grouped while B[j — 2 .. j] is not grouped, 
then both A[i — 2 .. i] and B[j — 2 .. j] are FS codons {A[i — 2..i] 
is a FS“ codon while B[j — 2 .. j] is a FS+ codon). We add 2 * 
fs_open_cost to the score of the alignment, and the alignment of 
the nucleotides of the two FS codons can be scored independently 
using the scoring function San- There are two cases depending on 
the number of nucleotides from B[j — 2 .. j] that are aligned with 
A[i — 2 .. i], two or one: 

A. If A[i — 2 .. i] is aligned with two nucleotides, then these 
nucleotides are B[j — 1] and B[j]. There are two cases depending 
on the alignment of the nucleotide B[j — 1] with A[*—1] or A[i—2]: 

2. s„„(A[i], H[j]) + San{A[i - l],B[j - 1]) + - 3, J - 2) + 2 * 

fs_open_cost 

3. San(A[i], H[j]) + San{A[l - 2], B[j - 1]) + - 3, J - 2) + 2 * 

fs_open_cost 

B. If A\i — 2 .. i] is aligned with one nucleotide, then this single 
nucleotide is and the score of the alignment is: 

4. SaniA[i], B[j]) + D{i — 3,j — 1) -|- 2 * fs_open_cost 

hi. If A\i — 2 .. i] is not grouped while B[j — 2 .. j] is grouped, there 
are three cases that are symmetric to the three cases from (a)ii.: 

5. San{A[i\,B\j]) -I- San{A[i - 1], B\j - 1]) -|- D(i - 2,j - 3) -I- 2 * 
fs_open_cost 

6. SaniA\i], B[j]) + SaniA[i - - 2]) -I- D{i - 2,j - 3) -I- 2 * 

fs_open_cost 

7. SaniA[i], B[j]) + D(i — 1, j — 3) -I- 2 * fs_open_cost 

iv. If both A[i — 2 .. i] and B[j — 2 .. j] are not grouped, then again 
both A[i — 2 .. i] and B\j — 2 .. j] are FS codons (both are FS+ 
codons): 

8. San(.A[i], B[j]) + D(i — 1, j — 1) -|- 2 * f s_open_cost 
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(b) If A[i] is aligned with a gap, then the codon A[i — 2 .. i] is a FS codon 
(FS“ or FS’*'). We must add f s_open_cost to the score of the alignment. 
There are two cases depending on whether Alt — 2 .. i\ is grouped in the 
alignment or not. 

i. If — 2 .. i] is grouped, then there are three cases depending on 
the number of nucleotides from B\j — 2 .. j] that are aligned with 
A[i — 2 .. i], two, one, or zero. 

A. If A[i — 2 .. i] is aligned with two nucleotides, then these 
nucleotides are B[j — 1] and B[j]. The score of the alignment is: 

B. If A[i — 2 .. i] is aligned with one nucleotide, then this single 
nucleotide is B\j]. There two cases depending on the alignment 
of the nucleotide B[j] with A[i — 1] or A[z — 2]: 

10. San{A[i — 1], B[j]) + D{i — 3, j — 1) + 2 * f s_open_cost 

11. ^<^AA[i^^],B[j]) _ 3 ^ j _l_ f s_open_cost 

C. If A[z —2 .. z] is aligned with zero nucleotide, then the codon 
A[i — 2 .. i] is entirely deleted. The score of the alignment is: 

12. gap_cost + D(i — 3, j) 

ii. If A[i — 2 .. z] is not grouped, then the codon A[i — 2 .. z] is a FS+ 
codon, and the score of the alignment is: 

13. D{i — 1, j) + fs_open_cost 


(c) If B[i\ is aligned with a gap, there are fives cases that are symmetric 
to the five cases from (b): 

^ ^ ■ _ 2 , j _ 3 ) + f s_open_cost 

15. San{A\i], B[j — 1]) + D{i — l,j — 3) + 2 * fs_open_cost 

16. _ 2, J — 3) + fs_open_cost 

17. gap_cost + D{i,j — 3) 

18. D{i,j — 1) + fs_open_cost 


Lemma 2 (Filling up table Dp). 

1. If i{mod 3) = 0 and j{mod 3) = 0 

DpiiJ) = D(i,j) 

2. If i{mod 3) = 2 and j{mod 3) = 0 

1. ■■ J + i]) ^ Dpii — 2, J — 2) + fs_extension_cost 

2. + San{A[i]^ B[j]) + D(i — 2, J — 1) + 2 * fs_open_cost 

DpiiJ) = max 3. “^r^{A^i+l],Blj + l]) ^ _ 3^ _ i) + fs_open.cost 

4. + ^ p}(^i _ 2, j) + fs-open-cost 

. 5. + _|_ fs_open^cost 

3. If i{mod 3) = 0 and j(mod 3) = 2, the equation is symmetric to the previ¬ 
ous case. 

4- If i(mod 3) = 1 and j{mod 3) = 0 

i l. -- •*+2],B[j -- i+2]) — I'j p fs^extension^cost 

2. ‘‘an(Alt + 2],Bi]+2]) San (^ [» +C , B [j + 1]) ^ ^ fs^OpeU^COSt 

3^ <.a„(A[i + 2 ],B[,- + 2 ]) .a„(A[i + l],B[, + l]) ^ fs_open_COSt 
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5. If i{mod 3) = 0 and jirnod 3) = 1, the equation is symmetric to the previous 
case. 

The proof of Lemma follows from Lemma It is given in Appendix. We 
now present the alignment algorithm using Lemma and in the next theorem. 

Theorem 1. Given two coding sequences A and B of lengths n and m, a maxi¬ 
mum score alignment between A and B can be found in time and space 0{nx m), 
using the following algorithm. 


Algorithm Align(A,B) 
for i = 0 to n do 

D{i,0) — floor{^) * gap-cost 

D^(^,0) = D(^,0) + . + fs.open.cost, ^f 

I ^ — 2 ' —+ fs-opeu-cost, %f 

for j = 0 to m do 

D{0,j) — floor{^) * gap^cost 

.(A[ll,i;[, + ll) ^ ,„„(A[ 21 .Bb-+ 21 ) js.open.cost, if 
fs.open-cost, if 

for i = 0 to n do 

for j = 0 to m do 

compute D(i,j) using Lemma [7| 

compute D[r(i,j) using Lemma if i (mod 3) = 0 or j (mod 3) = 0 


Df{S^,3) = -0(0, j) + 


i (mod 3) = 1 
i (mod 3) = 2 


j (mod 3) = 1 
j (mod 3) = 2 


The proof of Theorem is given in Appendix. 


4 Implementation 


We implemented the algorithm presented in this paper and the pairwise align¬ 
ment algorithm accounting for frameshift opening penalties described in m- 

We applied both algorithms to the alignment of the examples of coding se¬ 
quences Seql, Seq2, and Seq3 described in Sectionwith the following parame¬ 
ters: gap_cost = —1, fs_open_cost = —2, fs_extension_cost = —1, Saa corre¬ 
sponding to the amino acid substitution matrix BLOSUM62, and San returning 
a score of -1-1 (resp. —1) for a match (resp. mismatch) between two nucleotides. 
As predicted, the application of the algorithm from m to Seql, Seq2 and Seq3 
yields the same score of 72.0 for both the alignment between Seql and Seq2, and 
the alignment between Seql and Seq3. The present algorithm yields a score of 
68.5 for Seql and Seq2, and a lower score of 58.0 for Seql and Seq3. 

Using the same parameters, both algorithms were also applied to pairs of 
human coding sequences from paralogous genes that share a common coding 
subsequence translated in different frames (see [S] for a list of 470 pairs of hu¬ 
man coding sequences presenting a frameshift event). In Appendix, the align¬ 
ments obtained for the coding sequences of the protein NM_001083537 from 
Gene FAM86B1 and the protein NM.018172 from Gene FAM86G1 are shown. 
These alignments show that both coding sequences share a common prefix sub¬ 
sequence translated in the same frame, and a common subsequence at the end 
of NM_018172 translated in different frames, yielding a frameshift event. The 
algorithm of m yields a high score of 718.0 for the alignment, while the present 
algorithm return a score of 530 accounting for a frameshift extension length of 
81 nucleotides. 
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5 Conclusion 

We introduce a new algorithm for the pairwise alignment protein-coding se¬ 
quences, accounting for translation frameshift extensions and their consequences 
on the modification of the protein sequences. The dynamic programming al¬ 
gorithm has the same asymptotic space and time complexity as the classical 
Needleman-Wunsch algorithm. The perspectives of this work include the eval¬ 
uation of the impact of the new method on the comparison of pairs of coding 
sequences listed in biological databases. We also plan to study the extension of 
the method in the context of multiple protein-coding sequence alignment. 
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Appendix 

Complete proof of Lemma 

Proof (Complete proof of Lemma^. An illustration of the different configura¬ 
tions of alignment considered for the cases 1 and 2 of Lemma in this proof 
is given in Figure For each of the cases 1, 2, 3 and 4 of the Lemma, we first 
consider three cases depending on the configurations of the alignment of A[i\ 
and B[j]: (a) A[i] and B[j] are aligned together, (b) A[i\ is aligned with a gap, 
(c) B[j] is aligned with a gap. 

1. If i{mod 3) = 0 and j{mod 3) = 0, then A[i] and B[j] are the last nucleotides 
of two codons A[i — 2..i\ and B\j — 2 .. j]. There are three cases depending 
on the alignment of A[i\ and B[j]. 

(a) If A\i\ and B\j] are aligned together, there are four cases depending 
on whether A[i — 2 .. i] and B[j — 2 .. j] are grouped in the alignment 
or not. 

i. If both A[i — 2 .. i] and B[j — 2 .. j] are grouped, then —2 .. i] 
and B[j — 2 .. j] have to be aligned together, and the score of the 
alignment is: 

1. Saa{A[i - 2 .. i],B[j - 2 .. j]) + D{i - 3, J - 3) 

ii. If A\i — 2 .. i] is grouped while B[j — 2 .. j] is not grouped, 
then both A[i — 2..i] and B[j — 2 .. j] are FS codons {A[i — 2..i] 
is a FS“ codon while B[j — 2 .. j] is a FS+ codon). We add 2 * 
fs_open_cost to the score of the alignment, and the alignment of 
the nucleotides of the two FS codons can be scored independently 
using the scoring function San- There are two cases depending on 
the number of nucleotides from B\j — 2 .. j] that are aligned with 
A[i — 2 .. i], two or one: 

A. If A[i — 2 .. i] is aligned with two nucleotides, then these 
nucleotides are B[j — 1] and B[j]. There are two cases depending 
on the alignment of the nucleotide B[j — 1] with A[z—1] or A[i—2]: 

2. San{A[i], B\j]) + San{.A[i - 1], B[j - 1]) -|- D{i - 3, j - 2) -I- 2 * 
fs_open_cost 

3. San{A[i],B[j\) + San{A[i - 2], B[j - 1]) -f D{i - 3, j -2)+ 2* 
fs_open_cost 

B. If A[i — 2 .. i] is aligned with one nucleotide, then this single 
nucleotide is B[j], and the score of the alignment is: 

4. Sart(A[i], B[j]) -\- D{i — 3, j — 1) + 2 * f s_open_cost 

hi. If A[i —2 .. i] is not grouped while B[j — 2 .. j] is grouped, there 
are three cases that are symmetric to the three cases from (a)ii.: 

5. San{A[i\,B\j]) + San{A[i - 1], B\j - 1]) -|- D(i - 2,j - 3) -I- 2 * 
fs_open_cost 

6. SaniA[l],B\j]) + San{A[l - 1], B[j - 2]) + D{l - 2,j - 3) + 2 * 
fs_open_cost 

7. SaniA[i], B[j]) + D(i — 1, j — 3) + 2 * fs_open_cost 
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iv. If both A[i — 2 .. i] and B[j — 2 .. j] are not grouped, then again 
both A[i — 2 .. i] and B[j — 2 .. j] are FS codons (both are FS+ 
codons): 

8. San(,A[i], B[j]) + D{i — 1, j — 1) + 2 * f s_open_cost 

(b) If A[i] is aligned with a gap, then the codon A[i — 2 .. i] is a FS codon 
(FS“ or FS’*'). We must add fs_open_cost to the score of the alignment. 
There are two cases depending on whether Alt — 2 .. i\ is grouped in the 
alignment or not. 

i. If d.[z — 2 .. t] is grouped, then there are three cases depending on 
the number of nucleotides from B\j — 2 .. j] that are aligned with 
A[i — 2 .. i], two, one, or zero. 

A. If A[i — 2 .. i] is aligned with two nucleotides, then these 
nucleotides are B[j — 1] and B\j]. The score of the alignment is: 

B. If A[i —2 .. i] is aligned with one nucleotide, then this single 
nucleotide is B\j]. There two cases depending on the alignment 
of the nucleotide B[j] with A[i — 1] or A[z — 2]: 

10. SaniA[i — 1], B[j]) + D{i — 3, J — 1) + 2 * f s_open_cost 

11. ^<^AA[i^^],B[j]) _ 3 ^ J _ _l_ fs_open_cost 

C. If A[z —2 .. z] is aligned with zero nucleotide, then the codon 
A[i — 2 .. z] is entirely deleted. The score of the alignment is: 

12. gap_cost + D{i — 3, j) 

ii. If A[i — 2 .. z] is not grouped, then the codon A[i — 2 .. z] is a FS+ 
codon, and the score of the alignment is: 

13. D{i — l,j) + fs_open_cost 


(c) If B[i] is aligned with a gap, there are fives cases that are symmetric 
to the five cases from (b): 


14. 


s„„(.4[z],B[j-l]) ^a„(^[i-l],B[j-2]) 


+ 


DF{i — 2, J — 3) + f s_open_cost 


15. s. 

16. ^ 


2 2 
^{A[i],B[j — 1]) + D{i — 1, j — 3) + 2 * fs_open_cost 
,(AM,Bb-2]) 


+ ~ 1j j ~ 3) + fs_open_cost 


17. gap_cost + D{i,j — 3) 

18. D{i,j — 1) + fs_open_cost 


2. If i{mod 3) = 0 and j{mod 3) ^ 0, then A[i] is the last nucleotide of a 
codon A[z — 2 .. z] and B[j] is not the last nucleotide of a codon. There are 
three cases depending on the alignment of A\i] and B[f\. 

(a) If A[i] and B[j] are aligned together, there are two cases depending 
on whether A[z — 2 .. z] is grouped in the alignment or not. 
i. If A[i — 2 .. z] is grouped, there are three cases depending on the 
number of nucleotides from B that are aligned with A[z — 2 .. z], 
three, two, or one: 

A. If A\i — 2 .. z] is aligned with three nucleotides, then these 
nucleotides are B[j], B[j — 1], and B[j — 2]. We are in the case 
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of an unmatching (U) codon. The score of the alignment is then: 

1 . ■■ ^hB[j -2 .. j]) j — 3 ) + f s_extension_cost 

B. If A[i — 2 .. i] is aligned with two nucleotides, then these 
nucleotides are B[j] and B[j — 1]. A[i — 2 .. f] is a FS“ codon. 
There are two cases depending of the alignment of B[j — 1] with 
A[i — V\ or A[i —2]. In both cases, if j — 1 (mod 3) = 0, then j —1 is 
the last nucleotide of a codon. We should then make adjustments 
in order to account for the type of this codon (FS’*', or unknown 
type for now): 

2. San{A[i],B[f\) + San{A[i - - 1]) + D{i - 3,j - 2) + 

fs_open_cost (+fs_open_cost if j — l{'mod 3) = 0) 

3. San(^[*], B[j]) + San{A[i — 2],i?[j — 1]) + Dp{i — 3, J — 2) + 

fs_open_cost ^ 

C. If A[i — 2 .. i] is aligned with one nucleotide, then A[i — 2 .. i] 
is a FS“ codon. The score of the alignment is: 

4. SaniA[i], B[j]) + D{i — 3,j — 1) + fs_open_cost 

ii. If A\i — 2 .. i] is not grouped, then A[i — 2 .. f] is a FS"*" codon: 

5. San(,A[i], B[j]) + D{i — 1, j — 1) + f s_open_cost 

(b) If A\i] is aligned with a gap, there are two cases depending on whether 
A[i — 2 .. i] is grouped in the alignment or not. 

i. If A[i — 2 .. i] is grouped, there are three cases depending on the 
number of nucleotides from B that are aligned with A[i — 2 .. i], two, 
one, or zero. 

A. If A[i — 2 .. i] is aligned with two nucleotides, then these 
nucleotides are B[j] and B[j — 1]. A[i — 2 .. f] is a FS“ codon. If 
j — I (mod 3) = 0, then j — 1 is the last nucleotide of a codon. 
We should make adjustments in order to account for the fact no 
type has yet been decided for this codon. 

6 . SaniA[i - l],B[j])+SaniA[i-2],B[j - 1]) + Dpii - S, j - 2) + 

fs_open_cost ^an(A[^- 2 ],B[j-l]) j _ 3 ^ _ 

B. If A[i — 2 .. i] is aligned with one nucleotide, then this single 
nucleotide is B[j\. A[i — 2 .. z] is a FS“ codon. There are two 
cases depending on the alignment of B\j] with A[z — 1] or A\i — 2]: 

7. SaniA[i — 1],+ D{i — 3, j — 1) + fs_open_cost 

8 . San{A[i — 2],+ D{i — 3, j — 1) + fs_open_cost 

C. If A[i — 2 .. z] is aligned with zero nucleotide, the codon 
A[z — 2 .. z] is entirely deleted: 

9. gap_cost + D(i — 3, j) 

ii. If A\i — 2 .. i] is not grouped 

10. D{i — 1, j) + fs_open_cost 

(c) If B[j] is aligned with a gap, then the score of the alignment is: 

11 . 
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3. If iijnod 3) 7 ^ 0 and j(mod 3) = 0, the proof is symmetric to the previous 
proof for 2. 

4. If i{mod 3) 7 ^ 0 and j{mod 3) 7 ^ 0, there are three cases depending on the 
alignment of A\i] and B[j]. 

(a) If A\i\ and B\j] are aligned together, the score of the alignment is: 

1. San{A%B[j\)+D{l-l,J-l) 

(b) If A[i\ is aligned with a gap, the score of the alignment is: 

2. D{i-l,j) 

(c) If B[j] is aligned with a gap, the score of the alignment is: 

3. D{iJ - 1) 


Proof of Lemma [ 2 ] 

Proof (Proof of Lemma^ . The proof follows from Lemma 

1. If i{mod 3) = 0 and j{mod 3) = 0, this case is trivial. 

2. If i{mod 3) = 2 and j{mod 3) = 0, then i + l{mod 3) = 0 and j + l(TOod 3) = 

1 7 ^ 0. The five cases follow from the application of Lemma Case 2 for 

computing D{i + l,j + 1), and by keeping only the cases where + 1] 
and B[i + 1] are aligned together (cases 1, 2, 3, 4, 5 among the 11 cases). 
However, in each of the cases, we must subtract half of the score of aligning 
B[i 1] with A[i -\- 1] because this score will be added 

subsequently. 

3. If i{mod 3) = 0 and j(mod 3) = 2, the proof is symmetric to the previous 
case. 

4. If i{mod 3) = 1 and j{mod 3) = 0, then *+2(mod 3) = 0 and j + 2{mod 3) = 

2 7 ^ 0. Here again, the three cases follow from the application of Lemma [l] 

Case 2 for computing D{i + 2, j + 2), and by keeping only the cases where 
A[i + V\, B[i + 1\, and A[i + \ = 2], B[i + 2] can be aligned together (cases 
1, 2, 5 among the 11 cases). However, in each of the cases, we must subtract 
half of the scores of aligning B[i + 2] with H[i + 2], and aligning B[i + V\ with 
A[i + 1] ^ 5 an(A[z+i],B[j+i]) because theses scores will be 

added subsequently. 

5. If i{mod 3) = 0 and j{mod 3) = 1, the proof is symmetric to the previous 
case. 


Proof of Theorem [T] 

Proof (Proof of Theorem^. The proof relies on two points: (1) The algorithm 
computes the maximum score of an alignment between A and B, and (2) the 
algorithm runs with an 0{n.m) time and space complexity. 

(1) The validity of the algorithm, i.e. the facts that it fills the cells of the tables 
D, Dp according to Definition]^ follows from five points. 
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— The initialization of the tables is a direct consequence of Definition 

— Lemmas H] and [ 2 ] 

— The couples {i,j) of prefixes of A and B that need to be considered in the 
algorithm are all the possible couples for D{i,j), and only the couples such 
that i{mod 3) = 0 or j{mod 3) = 0 for Dpiijj) (see all the cases in which 
the table Dp is used in Lemmas[^(7 cases) and[^(3 cases)). 

— The couples (f, j) of prefixes of A and B are considered in increasing order of 
length, and D[i,j] is computed before Dpli^j] in the cases where i{mod 3) = 
0 or j{mod 3) = 0. 

— A backtracking of the algorithm allows to find a maximum score alignment 
between A and B. 

(2) The time and space complexity of the algorithm is a direct consequence of 
the number of cells of the tables D and Dp, 2x (n + lxm+1). Each cell is 
filled in constant time. 


Alignment of coding sequences NM_001083537 and NM_018172 using 
a previously published method lllj and the present method 

Trauislations of NM_001083537 and NM_018172 into protein sequences 
>NM_001083537 

MAPEENAGTELLLQGFERRFLAVRTLRSFPWQSLEAKLRDSSDSELLRDILQKTVRHPVC 

VKHPPSVKYAWCFLSELIKKSSGGSVTLSKSTAIISHGTTGLVTWDAALYLAEWAIENPA 

AFINRTVLELGSGAGLTGLAICKMCRPRAYIFSDPHSRVLEQLRGNVLLNGLSLEADITG 

NLDSPRVTVAQLDWDVAMVHQLSAFQPDVVIAADVLYCPEAIVSLVGVLQRLAACREHKR 

APEVYVAFTVRNPETCQLFTTELGRDGIRWEAEAHHDQKLFPYGEHLEMAMLNLTL* 

>NM_018172 

MAPEEMGSELLLQSFKRRFLAARALRSFRWQSLEAKLRDSSDSELLRDILQKHEAVHTE 

PLDELYEVLVETLMAKESTQGHRSYLLTCCIAQKPSCRWSGSCGGWLPAGSTSGLLNSTW 

PLPSATQRCASCSPPSYAGLGSDGKRKLIMTRNCFPTESTWRWQS* 


Score obtained with previously published method : 718.0 

A!!TGGCGCCCGAGGAGAACGCGGGGACCGAACTCTTGCTGCAGGGTTTTGAGCGCCGCT 
A!!TGGCGCCCGAGGAGAACGCGGGGAGCGAACTCTTGCTGCAGAGTTTCAAGCGCCGCT 

TCCTGG—CGGTGCGCACACTGCGCTCCTTC ! CCC—TGGCAGAGCTTAGAGGCAAAG 
TCCTGGCAGCGC—GCGCCCTGCGCTCCTT! ! CCGC ! ! TGGCAGAGCTTAGAAGCAAAG 

TTAAGAGACT! ! CATCAGATTCTGAGCTGCTGCGGGATATTTTGCAGAAGACTGTGAGGC 
TTAAGAGACT! ! CATCAGATTCTGAGCTGCTGCGGGATATTTTGC—AGA- 
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ATCCTGTGTGTGTGAAGCACCCGCCG!TCAGTCAAGTATGCCTGGTGCT!!TTCTCTCAG 
-AGCACGAGGC! !-TGT- 

AACTCATCAAAAAGTCCTCAGGAGGCTCAGTCACACTCTCCAAGAGCACAGCCATCATCT 
-CCA-CAC-AGAG! !—CCTT! ! — 

CCCACGGTACCACAGGCCTGGTCACATGGGATGCCGCCCTCTA!!CCTTGCAGAATGGGC 
-XGG-ATG-AGC-TGT—ACG— 

CATCGAGAACCCGGCAGCCTTCATTAACAGGACTGTCCTAGAGCTTGGCAGTGGTGCCGG 
—AGG-TGC-TGG—TGG—AGA 

CCTCACAGGCCTTGCCATCTGCAAGATGTGCCGCCCCCGGGCATACATCTTCAGCGACCC 
C! !-CCT-GAT-GGC- 

TCACAGCCGGGTCCTCGAGCAGCTCCGAGGGAATGTCCTTCTCAATGGCCTCTCATTAGA 
—CAA—GGA-GTC— 

GGCAGACATCACTGGCAACTTAGACAGCCCCAGGGTGACAGTGGCCCAGCTGGACTGGGA 
-CAC-CCA-GGGCCA- 

CGTAGCAATGGTCCAT! ! CAGCTCTCTGCCTTCCAGCCAGATGTTGTCATTGCAGCAGAC 
-CCG—GAGCTA-TTT-GCTGAC 

G!!TGCTGTATTGCCCAGAAGCCATCGTGTCGCTGGTCGGGGTCCTGCAGAGGCTGGCTG 
G!!TGCTGTATTGCCCAGAAGCCATCGTGTCGCTGGTCGGGGTCCTGCGGAGGCTGGCTG 

CCTGCCGGGAGCACAAGCGGGCTCCTGAGGTCTACGTGGCCTTTACCGTCCG! ! CAACCC 
CCTGCCGGGAGCACCAGCGGGCTCCTCAATTCTACATGGCCCTTACCGTCTG!!CAACCC 

AGAGACGTGCCAGCTGTTCACCACCGAGCTAG!GCCGGGA!!TGGGATC!!AGATGGGAA 
AGAGATGTGCCAGCTGTTCACCACCGAGCTAT!GCTGGAC!!TGGGATC!!AGATGGGAA 

GCGGAAGCTCATCATGACCAGAAACTGTTTCCCTATG!!GAGAGCACTTGGAGATGGCAA 
GCGGAAGCTCATCATGACCAGAAACTGTTTCCCTACA!!GAGAGCACTTGGAGATGGCAA 

TGCTGAACCTCACACTGTAG! 

-AGC-TGA 


Score obtained with present method: 530.0 

ATGGCGCCCGAGGAGAACGCGGGGACCGAACTCTTGCTGCAGGGTTTTGAGCGCCGCTTC 

ATGGCGCCCGAGGAGAACGCGGGGAGCGAACTCTTGCTGCAGAGTTTCAAGCGCCGCTTC 

CTGGCGGTGCGCACACTGCGCTCCTTCCCCTGGCAGAGCTTAGAGGCAAAGTTAAGAGAC 

CTGGCAGCGCGCGCCCTGCGCTCCTTCCGCTGGCAGAGCTTAGAAGCAAAGTTAAGAGAC 
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TCATCAGATTCTGAGCTGCTGCGGGATATTTTGCAGAAGACTGTGAGGCATCCTGTGTGT 
TCATCAGATTCTGAGCTGCTGCGGGATATTTTGCAG- 

GTGAAGCACCCGCCGTCAGTCAAGTATGCCTGGTGCTTTCTCTCAGAACTCATCAAAAAG 
—AAGCAC-GAG- 

TCCTCAGGAGGCTCAGTCACACTCTCCAAGAGCACAGCCATCATCTCCCACGGTACCACA 
-GCT—GTC—CAC—ACA-GA 

GGCCTGGTCACATGGGATGCCGCCCTCTACCTTGCAGAATGGGCCATCGAGAACCCGGCA 
G-CCT-TTG-GAT—GAGCTGTAC-GAG-GTG- 

GCCTTCATTAACAGGACTGTCCTAGAGCTTGGCAGTGGTGCCGGCCTCACAGGCCTTGCC 
-CTG-GTG—GAG-ACC—CTG— 

ATCTGCAAGATGTGCCGCCCCCGGGCATACATCTTCAGCGACCCTCACAGCCGGGTCCTC 
-ATG-GCC- 

GAGCAGCTCCGAGGGAATGTCCTTCTCAATGGCCTCTCATTAGAGGCAGACATCACTGGC 
-AAG-GAG- 

AACTTAGACAGCCCCAGGGTGACAGTGGCCCAGCTGGACTGGGACGTAGCAATGGTCCAT 
-TCC-ACC-CAG-GGC-CAC 

CAGCTCTCTGCCTTCCAGCCAGATGTTGTCATTGCAGCAGACGTGCTGTATTGCCCAGAA 
CGG—AGC—TAT-TTGCT—GACGTGCTGTATTGCCCAGAA 

GCCATCGTGTCGCTGGTCGGGGTCCTGCAGAGGCTGGCTGCCTGCCGGGAGCACAAGCGG 

GCCATCGTGTCGCTGGTCGGGGTCCTGCGGAGGCTGGCTGCCTGCCGGGAGCACCAGCGG 

GCTCCTGAGGTCTACGTGGCCTTTACCGTCCGCAACCCAGAGACGTGC—CAGCTGTTCA 
GCTCCTCAATTCTACATGGCCCTTACCGTCTGCAACCCAGAGA—TGTGCCAGCTGTTCA 

CCACCGAGCTA-GGCCGGGATGGGATCAGATGGGAAGCGGAAGCTCATCATGACCAG 

CCACCGAGCTATGCTGGA-CTGGGATCAGATGGGAAGCGGAAGCTCATCATGACCAG 

AAACTGTTTCCCTATGGAGAGCACTTGGAGATGGCAATGCTGAACCTCACACTGTAG 
AAACTGTTTCCCTACAGAGAGCACTTGGAGATGGCAA-AGC—TGA 





















